Data set: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv
Data set description: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
The data set structure:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Summary of the data set:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
We can see that the Variable X is just a numbering of the observations, so we drop it for the sake of clarity.
redwine$X = NULL
Let’s explore each individual variable. We start with quality since it is the main feature of interest.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The distribution of this variable looks approximately normal with a slight left-skewness. More than 1300 wines (which is over 80% of all wines) received a score of either 5 or 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
This variable ranges from 4.6 to 15.9 with a mode of 7.2. The distribution looks right-skewed normal. In addition to that we can see a few outliers to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This plot has similar characteristics as the one before. There are several outliers to the right and the distribution seems right-tailed normal, too. However, this variable has less variance than the fixed.acidity feature.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
It is hard to tell what kind of distribution this is. The mode at 0 is striking, as well as the outlier at 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of this variable has a long tail, so we apply a log10 transformation on the x-axis.
The result shows a right-skewed normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
It looks like this distribution has a tail to the right combined with some extreme outliers beyond. We display two additional plots for comparison. The first has a log10 transformation, while the second cuts off the top 3% chlorides values.
The distributions are similar, i.e. approximately normal with some tail to the right.
##
## 1 2 3 4 5 5.5 6 7 8 9 10 11 12 13 14
## 3 1 49 41 104 1 138 71 56 62 79 59 75 57 50
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 78 61 60 46 39 30 41 22 32 34 24 32 29 23 23
## 30 31 32 33 34 35 36 37 37.5 38 39 40 40.5 41 42
## 16 20 22 11 18 15 11 3 2 9 5 6 1 7 3
## 43 45 46 47 48 50 51 52 53 54 55 57 66 68 72
## 3 3 1 1 4 2 4 3 1 1 2 1 1 2 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
This variable follows a right-skewed distribution with some outliers to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Again, a right skewed distribution with two extreme outliers. Let’s refine the plot by removing the outliers and adjusting the binwidth.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The first variable in this data set which follows an almost textbook-like normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Another normal distributed variable with some outliers left and right.
##
## 0.33 0.37 0.39 0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
## 1 2 6 4 5 8 16 12 18 19 29 31 27 26 47
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
## 51 68 50 60 55 68 51 69 45 61 48 46 41 42 36
## 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
## 35 23 33 26 28 26 26 20 25 26 23 18 19 15 22
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
## 15 13 14 13 13 7 7 8 8 5 10 4 2 3 6
## 0.98 0.99 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.1 1.11 1.12
## 2 3 1 1 3 2 2 3 4 2 3 1 2 1 1
## 1.13 1.14 1.15 1.16 1.17 1.18 1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56
## 2 2 1 1 5 3 1 1 1 2 1 1 1 3 1
## 1.59 1.61 1.62 1.95 1.98 2
## 1 1 1 2 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
A right-skewed distribution with some outliers to the right. We exclude them for our refined plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of this variable is right-skewed.
The data set contains 1599 observations with 12 variables on the properties of the wine. Quality is a categorical variable, while the remaining features are numerical.
The main feature is quality, an integer variable measured from 0 (worst) to 10 (best). There are no wines rated with a quality of 0, 1, 2, 9, or 10 in this particular data set. We will examine if and how the other features influence the quality of the wine.
Intuitively, I assume that alcohol, residual.sugar, and acidity have the most influence on the quality of a wine. One reason is alcohol serving as a flavor carrier and secondly when tasting wine you notice the sweet- and sourness first and foremost.
No.
I find the distribution of citric.acid quite unusual. It does not follow a clear distribution and has the mode at 0. Alcohol has several spikes within its distribution which is unexpected for me. I would have assumed a much smoother distribution. Maybe winemakers target very specific alcohol levels during vinification.
Fortunately, the data set was already tidy. I removed the X variable, because it served just as a numbering for each observation.
We examine the correlation between each pair of variables.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Let’s visualize these values in a correlation matrix:
Quality has the strongest correlation with alcohol, followed by volatile.acidity, sulphates, and citric.acid. Volatile.acidity is the only feature in this group having a negative correlation with quality. Residual.sugar has surpringsly no correlation with quality and only little to none with the other features.
Maybe a scatterplot matrix of the variables of interest can give us more insights:
The scatterplots illustrate nicely the lack of relationship between residual.sugar with other variables and in particular with quality. We can also identify visually the influence of the varibles on quality with the help of the linear smoothing functions.
##
## Pearson's product-moment correlation
##
## data: redwine$residual.sugar and redwine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
There is relatively low variation of residual sugar. The median of residual.sugar is roughly 2 for each quality score.
##
## Pearson's product-moment correlation
##
## data: redwine$alcohol and redwine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The boxplot shows that the median percent alcohol content ranges from 9.7% up to 12.15%. It gets higher as the quality increases. The median for wines with quality of 5 or below is around 10%.
##
## Pearson's product-moment correlation
##
## data: redwine$volatile.acidity and redwine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
Volatile.acidity shows an inverse relation with quality. The higher the quality, the lower the median volatile.acidity. The median ranges from 0.37 to 0.845.
##
## Pearson's product-moment correlation
##
## data: redwine$sulphates and redwine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: redwine$citric.acid and redwine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
Both sulphates and citric.acid have a positive relation with quality. The relation is less distinct compared to alcohol and volatile.acidity. In addition to that, we observe that the variance within citric.acid is notably higher than within the sulphates variable.
The strongest correlation of quality with a feature is alcohol (0.48), followed by volatile.acidity (-0.39). There is also weak correlation between quality and sulphates (0.25) and quality and citric.acid (0.23).
What I found particularly surprising is the lack of correlation between quality and residual.sugar. As stated before, I assumed sugar is important for the taste of the wine and therefore it’s quality rating. Moreover, residual.sugar has only none to low correlation with the other features.
The strongest correlation with -0.68 is between pH and fixed.acidity. This is not surprising, since pH describes how acidic or basic a wine is.
We identified four features having the strongest correlation with quality in the previous section. Now we investigate the combinations of two of these features and their influence on quality.
It is hard to distinguish different quality scores, thus we use a better color scheme.
## alcohol volatile.acidity quality
## alcohol 1.0000000 -0.2022880 0.4761663
## volatile.acidity -0.2022880 1.0000000 -0.3905578
## quality 0.4761663 -0.3905578 1.0000000
Better wines are on the bottom right on the plot (high alcohol, low volatile acidity), while poorly rated wines tend to the top left (low alcohol, high volatile acidity).
## alcohol sulphates quality
## alcohol 1.00000000 0.09359475 0.4761663
## sulphates 0.09359475 1.00000000 0.2513971
## quality 0.47616632 0.25139708 1.0000000
Both alcohol and sulphates have a positive correlation with quality, confirming our previous findings.
## alcohol citric.acid quality
## alcohol 1.0000000 0.1099032 0.4761663
## citric.acid 0.1099032 1.0000000 0.2263725
## quality 0.4761663 0.2263725 1.0000000
A similar plot as before, though with much more variance on the y-axis.
## volatile.acidity sulphates quality
## volatile.acidity 1.0000000 -0.2609867 -0.3905578
## sulphates -0.2609867 1.0000000 0.2513971
## quality -0.3905578 0.2513971 1.0000000
## volatile.acidity citric.acid quality
## volatile.acidity 1.0000000 -0.5524957 -0.3905578
## citric.acid -0.5524957 1.0000000 0.2263725
## quality -0.3905578 0.2263725 1.0000000
This plot is less clear than the others. Given the similar nature of volatile.acidity and citric.acid we assume a strong correlation between them. A quick calculation confirms our suspicion: these variables are correlated with a value of -0.55.
## sulphates citric.acid quality
## sulphates 1.0000000 0.3127700 0.2513971
## citric.acid 0.3127700 1.0000000 0.2263725
## quality 0.2513971 0.2263725 1.0000000
Before we build our regression model based on these features, we have another look at the corresponding correlation table.
## alcohol volatile.acidity sulphates citric.acid
## alcohol 1.00000000 -0.2022880 0.09359475 0.1099032
## volatile.acidity -0.20228803 1.0000000 -0.26098669 -0.5524957
## sulphates 0.09359475 -0.2609867 1.00000000 0.3127700
## citric.acid 0.10990325 -0.5524957 0.31277004 1.0000000
I would argue to leave citric.acid out for our model, because it is quite strongly correlated with volatile.acidity and (to a lesser degree) with sulphates. We will build four different models, adding one more feature each time (including citric.acid nonetheless to check our intuition).
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = redwine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = redwine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = redwine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = redwine)
##
## ================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646***
## (0.175) (0.184) (0.196) (0.201)
## alcohol 0.361*** 0.314*** 0.309*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.265***
## (0.095) (0.097) (0.113)
## sulphates 0.679*** 0.696***
## (0.101) (0.103)
## citric.acid -0.079
## (0.104)
## ----------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3
## adj. R-squared 0.2 0.3 0.3 0.3
## sigma 0.7 0.7 0.7 0.7
## F 468.3 370.4 268.9 201.8
## p 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1621.8 -1599.4 -1599.1
## Deviance 805.9 711.8 692.1 691.9
## AIC 3448.1 3251.6 3208.8 3210.2
## BIC 3464.2 3273.1 3235.7 3242.4
## N 1599 1599 1599 1599
## ================================================================
As assumed citric.acid does not improve our model significantly. Interestingly, all models have a low R-squared value with at most 0.3.
All multivariate plots confirm the relationships of our previous findings. Especially alcohol serves as a good indicator for quality. One reason being its low correlation with the other predictors.
The plot with volatile.acidity vs citric.acid was surprising, since all the other plots showed a stronger trend. However, the surprise was quickly gone after realizing the similarity and correlation between both variables.
Yes, I built different models incorporating successively alcohol, volatile.acidity, sulphates, and citric.acid. All these models perform rather badly in predicting wine quality. If I had to choose one model it would be m2, which uses alcohol and volatile.acidity as input variables. I prefer it over m3, because it has similar performance while being simpler.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Quality, our main feature of interest, follows an approximately normal distribution with a slight left-skewness. It has low variance, because over 80% of all wines being rated with either 5 or 6. In addition to that, there are no wines present in this data set with a score of 0, 1, 2, 9, or 10.
## redwine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## redwine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## redwine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## redwine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## redwine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## redwine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
A surprising result: residual.sugar has no correlation with quality. The median amount of sugar is just above 2 grams per litre across all quality scores. Most wines have a residual.sugar value between 1.2 and 3.5 grams.
## Source: local data frame [6 x 2]
##
## quality COR
## (int) (dbl)
## 1 3 0.71739157
## 2 4 0.07559736
## 3 5 -0.01611186
## 4 6 -0.10765046
## 5 7 0.01889942
## 6 8 0.53271120
Alcohol and volatile.acidity are the two features with the highest influence on quality. The best rated wines are on the bottom right on the plot, meaning high alcohol percentage and low volatile acidity. Looking at the correlation between alcohol and volatile.acidity per quality, we see that except for a score of 3 and 8, there is no correlation resulting in desirable features for a linear model.
This project was very interesting and challenging. It was more time consuming than anticipated. Though this project emphasizes exploration (quick and dirty), it took some time to write down my thought process and polish it.
I was a bit disappointed regarding the data set itself. Mainly, I wished it would have contained more observations covering all ratings. Secondly, I was under the impression that several obvious features were missing, e.g. age of the wine, climate data, grape variety, or wine maker. Not surprisingly, this lead to poor results when building the linear regression models.
For future work, I would search for a richer data set with more observations and more variables. Additionally, a different machine learning approach like logistic regression or decision trees seem more appropriate to predict quality. In the end, my exploration could not uncover distinct linear relationships. Another area of improvement could be feature transformation, which I did not apply at all.
So what is the main takeaway? Whenever you want to buy a good bottle of red wine, look out for high alcohol content.
http://stackoverflow.com/questions/10680658/how-can-i-create-a-correlation-matrix-in-r
http://www.colorcombos.com/colors/5F021F
https://discussions.udacity.com/t/ggplot-functions/19294/2
http://ggobi.github.io/ggally/gh-pages/ggpairs.html
http://stats.stackexchange.com/questions/4040/r-compute-correlation-by-group